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We give an overview of statistical models and likelihood, to- 
gether with two of its variants: penalized and hierarchical likeli- 
hood. The Kullback-Leibler divergence is referred to repeatedly, 
for defining the misspecification risk of a model, for grounding 
the likelihood and the likelihood crossvalidation which can be used 
for choosing weights in penalized likelihood. Families of penalized 
likelihood and sieves estimators are shown to be equivalent. The 
similarity of these likelihood with a posteriori distributions in a 
Bayesian approach is considered. 
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1 Introduction 

Since its proposal by Fisher (1922), likelihood inference has occupied a cen- 
tral position in statistical inference. In some situations modified versions of 
the likelihood have been proposed. Marginal, conditional, profile and partial 
likelihoods have been proposed to get rid of nuisance parameters. Pseudo- 
likelihood and hierarchical likelihood may be used to circumvent numerical 
problems in the computation of the likelihood, generally due to multiple in- 
tegrals. Penalized likelihood has been proposed to introduce a smoothness 
a priori knowledge on functions, thus leading to smooth estimators. Several 
review have already been proposed, for instance Lee and Nelder (1992), but 
it is nearly impossible in a single paper to describe with some details all the 
types of likelihoods that have been proposed. This paper aims at describing 
the conventional likelihood and two of its variants: penalized and hierarchical 
likelihoods. The aim of this paper is not to give the properties of the esti- 
mators obtained by maximizing these likelihood but rather to describe these 
three likelihoods together with their link to the Kullback-Leibler divergence. 
This interest more turned to the foundations than to the properties, leads us 
to first develop some reflexions and definitions about statistical models and 
to give a slightly extended version of the Kullback-Leibler divergence. 

In section 2 we recall the definition of a density and the relationship be- 
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tween a density in the sample space and for a random variable. In section 3 we 
give a slightly extended version of the Kullbaclk-Leibler divergence (making 
it explicit that it also depends on a sigma-field). Section 4 gives an account 
of statistical models, distinguishing mere statistical families from statistical 
models, and defining the misspecification risk. Section 5 presents the like- 
lihood and discusses issues about its computation and the performance of 
the estimator of the maximum likelihood in terms of Kullback-Leibler risk. 
In section 6 we define the penalized likelihood and show that for a family 
of penalized likelihood estimators there is an identical family of sieves esti- 
mators. In section 7 we describe the hierarchical likelihood. In section 8 we 
briefly sketch the possible unification of these likelihoods through a Bayesian 
representation allowing to consider the maximum (possibly penalized) likeli- 
hood estimators as MAP estimators; this question however cannot be easily 
settled due to the non-invariance of the MAP for reparametrization. There 
is a short conclusion. 

2 Definition of a density 

Consider a measurable space (S, A) and two measures /i and v with \x abso- 
lutely continuous relatively to v. For Q a sub-cx-field of T the Radon-Nikodym 
derivative of /i with respect to v on X, denoted by: is the ^-measurable 
random variable such that 



The Radon-Nikodym derivative is also called the density. We are interested 
in the case where /i is a probability measure, that we will call P l ; v may 
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also be a probability measure, P°. In that case we can speak of likelihood 

pi ipO 

ratio and denote it Cg . In order to speak of likelihood function, we have 
to define a model (see section H|). Note that likelihood ratios (as Radon- 
Nykodym derivatives) are defined with respect to a sigma-field. If 7i and 
Q are different sigma-fields, fpTT^ and jjpsig are different, but if TC C Q the 
former can be expressed as a conditional expectation (given 7i) of the latter 
and we have the fundamental formula: 

~dP l 



Epo 



dP° 



n 



Q 



dP°\n 

Consider now the case where the measurable space (Q, J 7 ) is the sample 
space of an experiment and define a random variable X, a measurable func- 
tion from (Q, J 7 ) to (dt, B). We shall write in bold character a probability on 
(O,^ 7 ), for instance, P l . The couple (P X ,X) induces a probability measure 
on defined by: P X {B) = P 1 oX~ l (B),B G B. This probability mea- 

sure is called the distribution of X. If this probability measure is absolutely 
continuous with respect to Lebesgue (resp. counting) measure, one speaks of 
continuous (resp. discrete) variable. For instance, for a continuous variable 
we define the density fx = which is the usual probability density func- 
tion (p.d.f.). Note that the p.d.f. depends on both P 1 and X, while 

v ^ ' r v dP \X 

depends on X but not on a specific random variable X. Often in applied 
statistics one works only with distributions, but this may let some problems 
unsolved. 

Example 1. Consider the case where concentrations of CD4 lympho- 
cytes are measured. Q represents the set of physical concentrations that may 
happen. Let the random variables X and Y express the concentration in 
number of CD4 by mm 3 and by ml respectively. Thus we have Y = 10 3 X. 



So X and Y are different, although they are informationally equivalent. For 
instance the events {u : X(u) = 400} and {u : Y(u) = 400000} are the 
same. The densities of X and Y, for the same P 1 on (f2,jF), are obviously 
different. So, if we look only at distributions, we shall have difficulties to 
define rigorously what a model is. 

3 The Kullback-Leibler risk 

Many problems in statistical inference can be treated from the point of view 
of decision theory. That is, estimators for instance are chosen as minimiz- 
ing some risk function. The most important risk function is based on the 
Kullback-Leibler divergence: maximum likelihood estimators, use of Akaike 
criterion or likelihood crossvalidation can be gounded on the Kullback-Leibler 
divergence. Given a probability P 2 absolutely continuous with respect to 

a probability P 1 and X a sub-u-field of J 7 , the loss using P 2 in place 

P 1 /P 2 dP 1 

of P is the log-likelihood ratio L x = log—^ . Its expectation is 
P 1 /P 2 

Epi [L x ] . This is the Kullback-Leibler risk, also called divergence (Kull- 
back and Leibler, 1951; Kullback, 1959) or information deviation (Cencov, 
1972) or entropy (Akaike, 1973). The different names of this quantity reflects 
its central position in statistical theory, being connected to several fields of 
the theory. Several notations have been used by different authors. Here we 
choose the Cencov notation: 

I{P 2 \P 1 ;X) = E p i[L? /P2 ] 

If X is the largest sigma-field defined on the space we omit it in the notation. 
Note that the Kullback-Leibler risk is asymmetric and hence does not define 
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a distance between probabilities; we have to take on this fact. If X is a 
random variable with p.d.f. f x and f x under P 1 and P 2 respectively we 
have —^j = Ji (Y [ and the divergence of the distribution P x relative to P x 

dP \X Jx\ > 

can be written: 

AP 2 x\Px)= [log&(f\f x (x)dx. (1) 
J Jx \ x ) 

We have that I(P 2 \P 1 ;X) = X(P X \P X ). Note that on (SI, J 7 ) we have to 

specify that we assess the divergence on A 1 ; we might assess it on a different 

sigma-field and would of course obtain a different result. This gives more 

flexibility. In particular we shall use this in the case of incomplete data. The 

observation is represented by a sigma-field O. Suppose we are interested to 

make inference about the true probability on X . We have complete data 

if our observation is O — X. With incomplete data, in the case where the 

mechanism leading to incomplete data is deterministic, we have O C X . In 

that case it will be very difficult to estimate I(P 2 \P 1 ; X) and it will be more 

P 1 iP 2 

realistic to use I(P 2 \P 1 ;0) = Epi [Lq 1 ]. We need this flexibility to 
extend Akaike's argument for the likelihood and for developing model choice 
criteria to situations with incomplete data. 

Example 2. Suppose we are interested in modeling the time to an event, 
X, and we wish to evaluate the divergence of P 2 with respect to P l . It is 
natural to compute the divergence on the sigma-field X generated by X, 
I(P 2 \P l ; X) = I(P X \P X ) given by formula ([T|). Suppose however that ob- 
servation of X is right-censored at a fixed time C. We observe (X, 5) where 
X = min(X, C) and 5 = l{x<c}- Thus on {X < C} we observe all the events 
of X but on {X > C} we observe no more detailed event. If we represent the 
observation by the sigma-field O we can say that O is generated by (X, 5). It 
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is clear that we have O C X . Although in theory it is still interesting to com- 
pute the divergence of P 2 with respect to P 1 on the sigma-field X it is also 
interesting to compute it on the observed sigma-field, that is I(P 2 \P X \ O). 

It can be proved by simple probabilistic arguments that on {X < C} we 
have ^7 = TTTiTT and on \X > C\ we have = and thus 

dP \0 Jx^ x ) dP \Q b x^ c ) 

I{P 2 \P\0) = [ C h g ^f\ /i(x)dx + log§P S\C). 

J ° JX\ X ) a X\^) 

This will take all its importance in section [5] where P 1 will be the true 
unknown probability (denoted P*); the problem will not be to compute but 
to estimate this divergence. 

4 Statistical models and families 
4.1 Statistical families 

Consider a measure space (<S, A, /i). We consider a subset V of the prob- 
abilities on (S,A,fi). We shall call such a subset a family of probabilities. 
We may parametrize this family. Following Hoffmann- Jorgensen (1994) a 
parametrization can be represented by a function from a set with values in 
V: 9 — > P e . It is desirable that this function be one-to-one, a property linked 
to the identifiability issue which will be discussed later in this section. The 
parametrization associated to the family of probabilities V can be denoted 
n = (P e ; 6 G 9) and we have V = {P e ; 6 e ©}. We may denote IT ~ V. 
If 111 ~ V and H2 ~ V, IIi and II2 are two parameterizations of the same 
family of probabilities and we may note IIi ~ n 2 . 
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However we do not consider that a parametrized family on (9ft, B) repre- 
senting a family of distributions of a random variable is sufficient to specify 
a statistical model (here, we do not follow Hoffmann- Jorgensen, 1994). This 
is because the distributions depend on the random variables chosen, as ex- 
emplified in section [2J 

4.2 Statistical models 

A family of probabilities on the sample space of an experiment (Q, T) will be 
called a statistical model and a parametrization of this family will be called 
a parametrized statistical model. 

Definition 1 Two parametrized statistical models II = (P e ,0 £ 0) on X 
and IT = (P 7 ,7 £ T) on y are equivalent (in the sense that they specify 
the same statistical model) if X — y and they specify the same family of 
probability on (Q,X). 

The couple (X, II) of a random variable and a parametrized statistical model 
induces the parametrized family (of distributions) on (JR., £>): = (Pf-; 9 £ 
G). Conversely the couple (X, Tlx) induces II if X = T. In that case 
we may describe the statistical model by (X, Tlx)- Two different random 
variables X and Y induce two (generally different) parametrized families on 
(Jft, £>), and Uy- Conversely one may ask whether the couples (X, Tlx) 
and (Y, Yl Y ) define equal or equivalent parametrized statistical models. We 
need the definition of "informationally equivalent" random variables (or more 
generally random elements). 
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Definition 2 X and Y are informationally equivalent if the sigma-fields X 
and y generated by X and Y are equal. 

Each couple (X, P x ) induces a probability on (f2, X) P x ' e = P x oX and 
thus the couple (X, Ilx) induces the parametrized statistical model (P x ' e , 9 G 
G). Similarly each couple (Y,Py) induces a probability on (f2, y) P Ya = 
PyoY and the couple (Y, Hy) induces the parametrized statistical model 
(P Y,7 ,7 G T). Tautologically we will say that (X, Tlx) and (Y, Hy) define 
the same statistical models if (P x,e , 9 G 0) and (-P y ' 7 , 7 G T) are equivalent. 

Example 1 continued 

(i) Tlx = (A/"(10 3 ;a 2 ),a 2 > 0) and Ily = (AT(10 3 ; a 2 ), a 2 > 0) are the 
same parametrized families on (9ft, B). However if X and Y are measurements 
of the same quantity in different units, these parametrized families correspond 
to different statistical models. 

(ii) Tlx = a 2 ); fie 3ft, a 2 > 0) and Ily = (W(/i, a 2 ); fi G 3ft, a 2 > 0) 
are the same parametrized family on (3ft, £>). (X, Yl x ) and (Y, Ily) specify 
the same statistical model but not the same parametrized statistical model. 

(iii) U x = (A/"(10 3 ;cx 2 ),^ 2 > 0) and Ily = (Af(10 6 ; 10V 2 ), a 2 > 0) are 
different families on (3ft, B). However (X, Ux) and (Y, Ily) specify the same 
statistical model (with the same parametrization). 

For sake of simplicity we have considered distributions of real random 
variables. The same can be said about random variables with values in 3ft d 
or stochastic processes which are random elements with values in a Skoro- 
hod space. Commenges and Gegout-Petit (2007) gave an instance of two 
informationally equivalent processes. The events described by an irreversible 
three-state process X = (X t ), where X t takes values 0, 1, 2, can be described 
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by a bivariate counting process N = (Ni,N 2 ). The law of the three-state 
process is specified by the transition intensities aoi, 002, cti2- There is a way 
of expressing the intensities Ai and A2 of N% and N2 such that the laws of 
X and N correspond to the same probability on (Q, J 7 ) . Thus the same 
statistical model can be described with X or with N. 

4.3 Statistical models and true probability 

So-called objectivist approaches to statistical inference assume that there is 
a true, generally unknown, probability P*. Frequentists as well as objec- 
tivist Bayesians adopt this paradigm while subjectivist Bayesians such as De 
Finetti (1974) reject it. We adopt the objectivist paradigm which is more 
suited to answer scientific issues. Statistical inference aims to approach P* 
or functionals of P*. Model II is well specified if P* e II, mis-specified 
otherwise. If it is well specified there is a (9* G such that P e * = P*. If 
we consider a probability P e we may measure its divergence with respect to 
P* on a given sigma-field O by I(P \P*;0), and we may choose 9 which 
minimizes this divergence. We assume that there exist a value 9 opt which 
minimizes I(P 9 \P*;0). We call I(P 9opt \P*;0) the misspecification risk of 
model II. Of course if the model is well specified I(P e \P*; O) is minimized 
at 9* and the misspecification risk is null. 
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5 The likelihood 



5.1 Definition of the likelihood 

Conventionnally most statistical models assume that independently identi- 
cally distributed (iid) random variables, say Xi,i = 1, . . . ,n, are observed. 
However in case of complex observation schemes the observed random vari- 
ables become complicated; moreover the same statistical model can be de- 
scribed by different random variables. For instance, in Example 2 the ob- 
served random variables are the couples (Xi,Si). However we may also de- 
scribe the observation by (5iXi,5i), or in terms of counting processes by 
(-^ujO < u < C), where (iV* = l{ Xi <u})- These three descriptions are 
observationally equivalent, in the sense that they correspond to the same 
sigma-field, say Oi = a(X u 5i) = a(5 l X i ,5 i ) = a(N l u ,0 <u<C). We shall 
adopt the description of observations in terms of sigma-fields because it is 
more intrinsic. We shall work with a measure space (Q, J 7 ) containing all 
events of interest. For instance the observation of subject i, Oi, belongs to 
T . Saying that observations are iid means that the Oi are independent, that 
there is a one-to-one correspondence between Oi and Oy and that the re- 
strictions of P* to Oi, P* ,, are the same. We call O n the global observation: 
O n = V™ =1 (9j. Since we do not know P* we may in the first place reduce the 
search by restricting to a statistical model II and find a P e £ II close to P*, 
that is, one which minimizes I(P \P*; Oi). We have already given a name 
to it, pOopt but we cannot compute it directly because we do not know P*. 
The problem is that I(P \P*;Oi) doubly depends on the unknown P*: (i) 
through the Radon-Nikodym derivative; (ii) through the expectation. Prob- 
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p* /P B p* /P° p° /P° 

lem (i) can be eliminated by noting that L . = L . +L . . Thus, 
by taking expectation under P*: 

I(P e \P*; Oi) = I{P°\P*; Oi) - E p *(L^ /P ") 

P e /P° 

Minimizing I(P \P*;Oi) is equivalent to maximizing Ep*(L . ). We 
cannot compute Ep*(L . ) but we can estimate it. The law of large 
numbers tells us that: 



„ -l rP e /P° p (r P e /P° 

i=l 



Thus we may maximize the estimator on the left hand or equivalent ly the 
Qctior 



P /P° dP° 

likelihood function £ n = — ^ „ . The likelihood function is the func- 

iZ„ dP |U 



P /P 

tion 9 — > £q . In conclusion the maximum likelihood estimator (MLE) 
can be considered as an estimator which minimizes a natural estimator of 
the Kullback-Leiber risk. 



5.2 Computation of the likelihood 

Computation of the likelihood is simple in terms of the probability on the 
observed a-field. The conventional way of specifying a model is in terms of 
a random variable and a family of distributions (X, (/x(-)eee)- Then the 
likelihood for observation X is simply f^-(X). When the events of interest 
are represented by stochastic processes in continuous time, it is also possible 
to define a density and hence a likelihood function. See Feigin (1976) for 
diffusion processes and Jacod (1975) for counting processes. 

Two situations make the computation of the likelihood more complex. 
The first is when there is incomplete observation of the events of interest. A 
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rather general approach of this problem is through the concept of coarsening, 
and to make reasonably simple computation the concept of ignorability of 
the mechanism leading to incomplete data has been promoted (Heitjan and 
Rubin, 1991). This has been generalized to the stochastic process framework 
by Commenges and Gegout-Petit (2005) (which also give some general for- 
mulas for likelihood calculus). The second situation occurs when the law is 
described through a conditional probability and the conditioning events are 
not observed. This is the framework of random effects models. Although 
conceptually different these two situations lead to the same problem: the 
likelihood for subject i can be relatively easily computed for a "complete" 
observation Qi and the likelihood for the observation Oi C Q$ is the condi- 
tional expectation (which derives from the fundamental formula): 



The conditional expectation is expressed as an integral which must be com- 
puted numerically in most cases. The only notable exception is the linear 
mixed effects model where the integral can be analytically computed. For 
examples of algorithms for non-linear mixed effects see Delyon, Lavielle and 
Moulines (1999), Guedj, Thiebaut and Commenges (2007) and for general 
formulas for the likelihood of interval-censored observations of counting pro- 
cesses, Commenges and Gegout-Petit (2007). 



P"/P 



P"/P 



,0 




13 



5.3 Performance of the MLE in terms of Kullback- 
Leibler risk 

We expect good behaviour of the MLE 9 when the law of large numbers can 
be applied and when the number of parameters is not too large. Some cases 
of unsatisfactory behaviour of the MLE are reported for instance in Le Cam 
(1990). The properties of the MLE may not be satisfactory when the number 
of parameters is too large, and especially when it increases with n such as in 
an example given by Neymann and Scott (1948). 

To assess the performance of the MLE we can use a risk which is an 
extended version of the Kullback-Leibler risk: 

EKL(P § ,P*) = E P *(L%/ p0 ). 

The difference with the classical Kullback-Leibler risk is that here P e is ran- 
dom: so EKL(P e ,P*) is the expectation of the Kullkack-Leibler divergence 
between P 6 and P*. In parametric models (that is, G is a subset of W) it 
can be shown (Linhart and Zucchini, 1986; Commenges et al., 2008) that 

EKL(P e >*) = E p .[Lf /p6 ° pt } + in^Trg-V) + o^ 1 ), (2) 

where / is the information matrix and J is the variance of the score, both 
computed in 8 opt . This can be nicely interpreted by saying that the risk 
EKL(P , P*) is the sum of the misspecification risk Ep* [L x ] and the 

statistical risk |n _1 Tr(/ _1 J). Note in passing that if II is well specified we 

JD* I T3 opt ~ 

have E p *[Lx ] = and I = J, and thus EKL(P e , P*) = £ + o{n- 1 ). 
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6 The penalized likelihood 

There is a large literature on the topic: Good and Gaskin (1971), Wahba 
(1983), Sullivan (1988), Hastie and Tishirani (1990), Joly and Commenges 
(1999), Gu and Kim (2002) among others. Penalized likelihood is useful when 
the statistical model is too large to obtain good estimators, while conven- 
tional parametric models appear too rigid. A simple form of the penalized 
log-likelihood is 



where J(9) is a measure of our dislike of 9 and k weights the influence of this 
measure on the objective function. A classical example is when 9 = (a(.), (3), 
where a(.) is a function and (3 is a real parameter. J (9) can be chosen as 



In this case J (9) measures the irregularity of the function a(.). The maximum 
penalized likelihood estimator (MpLE) 9^ is the value of 9 which maximizes 
pl K (9). k is often called a smoothing coefficient in the cases where J (9) 
is a measure of the irregularity of a function. More generally, we will call 
it a meta-parameter. We may generalize the penalized log-likelihood by 
replacing kJ{9) by J(9,k), where k could be multidimensional. When k 
varies, this defines a family of estimators, (^; k > 0). k may be chosen by 
cross-validation (see section ED- 

There is another way of dealing with the problem of possibly too large 
statistical models, the so-called sieve estimators. Consider a family of models 
(P v )v>o where: 



pl tt (9) = log£ o -Kj(9). 




V u = (P d ;9e 6 : J{9) < u). 
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For fixed u, the MLE solves the constrained maximization problem: 

maxL^; subject to J(9) < v (3) 

Let us denote 9 V the MLE. When v varies this defines a family of sieve 
estimators: (9 u ;v > 0). 9 V maximizes the Lagrangian L e a — \[J(9) — v\ 
for some value of A. The Lagrangian superficially looks like the penalized 
log-likelihood function but an important difference is that here the Lagrange 
multiplier A is not fixed and is a part of the solution. If the problem is convex 
the Karush-Kuhn- Tucker conditions are necessary and sufficient. Here these 
conditions are 

J(»)<,;A>0;^-A^ = 0. (4) 

It is clear that when the observation O is fixed, the function k — > J{9 pl ) 
is a monotone decreasing function. Consider the case where this function is 
continuous and unbounded (when k — > 0). Then for each fixed v there exists a 
value, say k u , such that J(9p 1 u ) = v. Note that this value depends on O. Now, 
it is easy to see that 9 P ^ U satisfies the Karush-Kuhn- Tucker conditions (j4j), 
with A = k v . Thus if we can find the correct n v we can solve the constrained 
maximization problem by maximizing the corresponding penalized likelihood. 
However, the search for k v is not simple and we must remember that the 
relationship between v and k v depends on O. A simpler result, deriving 
from the previous considerations, is: 

pi 

Lemma 1 (Penalized and sieves estimators) The families {P° K ; k > 0) 
and (P e "; v > 0) are identical families of estimators. 

The consequence is that since it is easier to solve the unconstrained max- 
imization problem involved in the penalized likelihood approach one should 
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apply this approach in applications. On the other hand it may be easier to 
develop asymptotic results for sieve estimators (because 9 U is a MLE) than 
for penalized likelihood estimators. One should be able to derive properties 
of penalized likelihood estimators from those of sieve estimators. 

7 The hierarchical likelihood 

An important class of models arises when we define a potentially observ- 
able variable Yi for each subject and its distribution is given conditional on 
unobserved quantities. Specifically, let us consider the following model: con- 
ditionally on b % , Yi has a density fy\b(-] 9, b l ), where 9 is a vector of parameters 
of dimension m and b l are random effects (or parameters) of dimension K. 
The (Yi, b l ) are i.i.d. Typically Yi is multivariate of dimension rij. We as- 
sume that the b % have density /&(.;t), where r is a parameter. Typically Yi 
is observed while b l is not. This can be made more general for including the 
case of censored observation of Yj. This is the classical framework of random 
effects models. 

The conventional approach for estimating 9 is to compute the maximum 
likelihood estimators. Empirical Bayes estimators of the b % can be computed 
in a second stage. The likelihood is computed by taking the expectation of 
the conditional likelihood given the random effect. 

£ 0i = E(£ 0i | 6 ja). 

Practically the computation of this conditional expectation involves the inte- 
gral / fY\ b (Yi\b) fb(b)db. However, the computation of these multiple integrals 
of dimension K is a daunting task if K is larger than 2 or 3, especially if the 
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likelihood given the random effects is not itself very easy to compute: this is 
the curse of dimensionality. 

For hierarchical generalized linear models the hierarchical likelihood, or 
h-likelihood, was proposed by Lee and Nelder (1996, 2001); see also Lee, 
Nelder and Pawitan (2006). The h-likelihood is the joint likelihood of the 
observations and the (unobserved) random effects. Estimators (here denoted 
MHLE) of both 9 and b are obtained by maximizing the h-likelihood. In 
practice this is done by maximizing the h-loglikelihood: 

n 

hl{i) = I? Q -$>g t). 

— n i=i 

where Lj>. is the loglikelihood for the observation conditional on b, and 
7 = (9, b) is the set of all the "parameters". Lq is the likelihood computed 

-=—n 

as if b were ordinary parameters; in a conventional random-effect approach 
this would be considered as a conditional likelihood. Often the loglikelihood 
can be written Lq = Yh l°g f(Yi] 0, b l ). However this formulation is not 

——71. 

completely general, because there are interesting cases where observations of 
the Yi are censored. So we prefer writing the loglikelihood as Lj>. where 
O n represents the observed a-field for n subjects. We note 7 = (9, b) the 
maximum h-likelihood estimators of the parameters for given r; the latter 
(meta) parameter can be estimated by profile likelihood. The main interest 
of this approach is that there is no need to compute multiple integrals. This 
problem is replaced by that of maximizing hi (7) over 7, that is, a large 
number of parameters: this number is equal to m + nK. This may be large 
but special algorithms can be used for generalized linear models. 

Therneau and Grambsch (2000) used the same approach for fitting frailty 
models, calling it a penalized likelihood. It may superficially look like the 
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penalized quasi likelihood of Breslow and Clayton (1993) but this is not the 
same thing. There is a link with the more conventional penalized likelihood 
for estimating smooth functions discussed in section 6. The h-likelihood can 
be considered as a penalized likelihood but with two important differences rel- 
ative to the conventional one: (i) the problem is parametric; (ii) the number 
of parameters grows with n. Commenges et al. (2008) have proved that the 
maximum h-likelihood estimators for the fixed parameters are M-estimators 
(see van der Vaart, 1998). Thus under some regularity conditions they have 
an asymptotic normal distribution. However, this asymptotic distribution is 
not in general centered on the true parameter values so that the estimators 
are biased. In practice the bias can be negligible so that this approach can 
be interesting in some situations due to its relative numerical simplicity. 

8 The likelihood cross-validation criterion 

An important issue is the choice between different estimators. Two typical 
situations are : (i) choice of MLE's in different models; (ii) choice of MpLE's 
with different penalties. If we consider two models II and II' we get two 
estimators P e and P^ of the probability P* and we may wish to assess which 
is better: this is the "model choice" issue. A penalized likelihood function 
produces a family of estimators (P 9 * ; k > 0) and we may wish to choose 
the best. Here what we call "the best" estimator is the estimator which 
minimizes some risk fucntion; in both cases we can use the extended version 
of the Kullback-Leibler risk already used in section 5: 

EKL(P^P*) = Ep,(LS7 p8 ). 
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Since P* is unknown we can first work with EKL(P e ,P°) = E P *(L^/ F ), 
which is equal, up to a constant, to EKL(P e ,P*). Second we can, as usual, 
replace the expectation under P* by expectation under the empirical distri- 
bution. In parametric models Akaike (1973) has shown that an estimator of 
EKL(P^,P°) was — n -1 (£Q./ P — p). The AIC criterion can be deduced by 
multiplying by 2n. The result can be used to estimate the difference of risks 
between two estimators in parametric models A(P e ,P^) = EKL(P 6 ',P*) — 
EKL(P^,P*) by the statistic D(P § ,P^) = (l/2n)(AIC(P § ) - AIC(F*)) 
and a more refined analysis of the difference of risks can be developed, as in 
Commenges et al. (2008). 

9 Link withe MAP estimator 

One important issue is the relationship between the three likelihoods con- 
sidered here and the Bayesian approach. The question arises because it 
seems that these three likelihoods can be identified with the numerator of 
a posteriori distributions with particular priors. Thus MLE, MpLE and 
MHLE could be identified with the maximum a posteriori (MAP) estima- 
tors with the corresponding priors. However, this relationship depends on 
the parametrization. Thus the MLE is identical to the MAP using a flat 
prior for the parameters; if we change the parametrization, the flat prior on 
the new parameters does not correspond to the flat prior on the original pa- 
rameters, as was already noticed by Fisher (1922). This apparent paradox 
led Jeffreys to propose a prior invariant for parametrization (Jeffreys, 1961), 
known as Jeffrey's prior. However the MAP with Jeffreys's prior is no longer 
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identical to the MLE when Jeffreys's prior is not flat. For instance for the 
parameter of binomial trial Jeffreys's prior is 1/ yjp{\ — p). Adding the log- 
arithm of this term to the loglikelihood shifts the maximum away from 0.5. 
Moreover it is questionable whether this invariance property can be identified 
with a non-informativeness character of this prior (for a review on the choice 
of priors, see Kass and Wasserman, 1996). 

In the Bayesian paradigm, rather than considering estimators based on 
maximization of some expression such as the likelihood or posterior density, 
it is common to attempt to summarize the statistical inferences by using 
qunatiles of the posterior distribution, such as the median, or expectations 
with respect to the posterior. While such expectation may be more satis- 
factory, they typically involve multiple integrals which are hard to compute: 
computations are mostly being done with the MCMC algorithm. Maximiza- 
tion methods have the advantage of being potentially easier in the case where 
multiple integrals can be avoided. There are also approximate Bayesian 
methods which yield the a posteriori marginal distribution by approximating 
some of the multiple integrals by Laplace approximation, which in turn in- 
volves a maximization problem: Rue et al., (2008) claim that this approach 
is much faster than the MCMC algorithm. 

Conclusion 

The Kolmogorov representation of a statistical experiment has to be taken 
seriously if we want to have a deep understanding of what a statistical model 
is. The Kulback-Leibler risk is underlying most of the reflexions about likeli- 
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hood, as was clearly seen by Akaike (1973). Finally the link with the Bayesian 
approach should be explored more thoroughly than could done in this paper. 
The MLE and MAP estimators are the same if, in a given paramtrization, 
the prior used for the MAP is flat. However, this identity does not resist 
to a reparametrization. Similar remarks hold for the link between penalized 
likelihood and MAP. 
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